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Abstract 

Pedigrees are directed acyclic graphs that represent ancestral relationships be- 
tween individuals in a population. Based on a schematic recombination process, 
we describe two simple Markov models for sequences evolving on pedigrees - Model 
R (recombinations without mutations) and Model RM (recombinations with muta- 
tions). For these models, we ask an identifiability question: is it possible to con- 
struct a pedigree from the joint probability distribution of extant sequences? We 
present partial identifiability results for general pedigrees: we show that when the 
crossover probabilities are sufficiently small, certain spanning subgraph sequences 
can be counted from the joint distribution of extant sequences. We demonstrate how 
pedigrees that earlier seemed difficult to distinguish are distinguished by counting 
their spanning subgraph sequences. 

Mathematics Subject Classifications: Primary: 60, 05, Secondary: 05C60,92D 
Keywords: reconstructing pedigrees, identifiability, recombinations, mutations 



1 Introduction 



Phylogenetics is a study of how species are related to each other. Evolutionary relation- 
ships are most conveniently represented by rooted leaf-labelled trees, where the leaves 
represent extant species and the root represents their most recent common ancestral 
species. Similarly other internal vertices of evolutionary trees correspond to extinct an- 
cestral species. 

The arrival of DNA and protein sequence data in the last forty years led to explosive 
growth of phylogenetics. Many of the modern phylogenetic methods consider sequence 
data under probabilistic models of sequence evolution. For such models to be useful for 
phylogenetic inference, it is important to establish their identifiability (i.e., to show that 
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nonisomorphic trees or different model parameters cannot induce the same distribution 
on the sequences at the leaves under a given model of sequence evolution). Mathematical 
theory of phylogenetic trees, especially probabilistic models of sequence evolution, the 
associated questions of identifiability and statistical consistency (especially of the maxi- 
mum likelihood methods) have been extensively studied [T2| [5], giving a firm statistical 
foundation to the study of phylogenetic trees. 

While phylogenetic trees represent relationships between species, population pedigrees 
represent how individuals within a population are related to each other. Communities all 
over the world have long been curious about knowing their ancestral histories, and have 
often kept detailed records of their family trees. In fact this curiosity goes back much 
further in the past than the interest in constructing evolutionary relationships between 
species. An example of a fairly detailed record of family histories is the Icelandic database 
Islendingabo (The Book of Icelanders |http: //www, i slendingabok. is) of genealogical 
records that covers almost the whole Icelandic population and goes back to nearly 1200 
years. Such ancestral histories are often compiled from a variety of sources such as church 
records, birth and death records, obituaries etc. that are prone to ambiguities or missing 
data beyond a few generations in the past. 

In the last several years large amounts of data on intra-population genetic variation 
have been recorded. For example, the Icelandic biomedical company deCode Genetics has 
compiled genomic, genealogical and health data of more than 100000 individuals (which 
is a significant proportion of the current Icelandic population). Such data offer promising 
opportunity to cross check and resolve ambiguities in historic genealogical records besides 
being useful for other studies such as, for example, genetic factors associated with medical 
conditions. Therefore, there is a renewed interest in accurately inferring pedigrees from 
genomic data. The statistical and combinatorial foundation for studying reconstruction 
problems for pedigrees has not been as developed as in phylogenetics. A purpose of this 
paper is to continue our earlier attempts to develop such a foundation for the problem of 
reconstructing pedigrees from observations (sequences) on extant individuals. 

To develop such a foundation, we need to establish results along the following lines: 
developing a biologically realistic model for sequences undergoing mutations and recom- 
binations, and identifiability results for such a model; statistical consistency results; and 
finally results that give estimates for the amount of genomic data necessary to reliably 
construct pedigrees. This paper is mainly about identifiability questions. 

In the rest of this section, we discuss the above theoretical motivation in more detail. 
We begin by informally sketching some well known reconstruction and identifiability re- 
sults for phylogenetic trees that not only motivate the work in this paper but also are 
crucially useful in the proof of the main identifiability result of this paper. 

Results of Zareckii and Buneman. It was shown in [2U] that a leaf-labelled tree can 
be uniquely constructed from the pairwise distances between its leaves. The result was 
strengthened slightly as follows [3]. Suppose that / is an additive function on the family 
of subsets of cardinality 2 of the vertex set of a leaf-labelled tree. Here additive means 
that for any two vertices r and s, we have f({r,s}) = ^ /(e), where the summation is 
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over all edges e on the (unique) path from r to s. Buneman showed that knowing / on all 
pairs of leaves of a leaf-labelled tree without vertices of degree 2 is sufficient to uniquely 
construct the tree and the function. It is not surprising that these results are quite useful 
in phylogenetics, where observations on extant species (leaves of an evolutionary tree) are 
used to infer a suitably defined distance (or an additive function) between pairs of species, 
and then their phylogenetic tree is constructed uniquely. 

Results of Steel and Chang. Now suppose the evolutionary process on a (rooted) tree is 
modelled as follows: first the root is assigned a random state from a finite alphabet £ 
(e.g., £ may be {A,T,G,C} in the case of DNA sequences). Each state is assumed to have 
a nonzero probability of being assigned to the root. Each edge of the tree has associated 
with it a |S| x |S| matrix of substitution probabilities. These substitution probabilities 
determine how the vertex-states evolve away from the root, and induce a distribution of 
states at the leaves of the tree. The model was formulated in [13] as described above, while 
a slightly more general formulation in terms of Markov random fields on unrooted trees 
was given in [3]. It was independently shown by Chang and Steel that when the matrices 
defining the substitution probabilities satisfy certain mild conditions, the (unrooted) tree 
can be uniquely recovered from the joint distribution of states at the leaves of the tree. 
In particular they showed that the negative logarithm of the determinant of the matrix 
of substitution probabilities between pairs of vertices is an additive function on the pairs 
of vertices, and can be computed from the probability distribution on extant sequences. 
It was further proved in [4] that the substitution matrices are also identifiable from the 
marginal distributions on triples of leaves of the tree. Special cases of these results for 
models more commonly used in phylogenetics were known in the phylogenetics literature 
earlier. 

How far can we generalise such results if the underlying structure is more general than 
a tree? A recent result in this direction is due to (2] where it is shown that under some 
mild non-degeneracy conditions the dependency structure of a Markov random field can 
be obtained from sufficiently many independent samples. 

In this paper we present simple models for recombinations and mutations for popula- 
tion pedigrees, and generalise phylogenetic identifiability results for them. One difference 
between reconstructing Markov random fields and reconstructing pedigrees is that for 
pedigrees we have observations only on the extant individuals (e.g., DNA sequences de- 
rived from living individuals). Moreover, in the problem of reconstructing pedigrees, the 
samples of data (e.g., columns in a sequence alignment) are not i.i.d. (independent and 
identically distributed) as a result of recombinations. 

In [H[in], we studied some purely combinatorial reconstruction problems motivated 
by Zareckii's result, for example, the problem of reconstructing a pedigree from the pair- 
wise distances between its extant individuals or from its subpedigrees (pedigrees of subsets 
of the extant population). In [15], we showed that a pedigree cannot in general be recon- 
structed from the collection of its proper subpedigrees. Such a result implies that knowing 
pairwise distances between extant vertices is in general not enough to reconstruct pedi- 
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grees. 

In [12], we considered models for sequences evolving on pedigrees, and showed that 
for certain simple Markov models, pedigrees are not identifiable from the distribution 
of observed states at extant vertices. We did construct examples of processes for which 
pedigrees could be proved to be identifiable, but the processes lacked the Markovian 
property, which informally states that the state observed at a vertex depends only on the 
states of its parents. Moreover, it seems that pedigrees that are difficult to reconstruct in 
a purely combinatorial framework (e.g., from pairwise distances between extant vertices or 
from subpedigrees) are also likely to be difficult to reconstruct in a stochastic framework. 
For example, if a pedigree cannot be reconstructed from its proper subpedigrees, then 
the marginal distributions of extant sequences on proper subsets of the extant population 
might be insufficient to uniquely recover the pedigree. On the other hand, negative results 
in a combinatorial setting may not imply non-identifiability in a stochastic framework. It 
is therefore important to study combinatorial reconstruction problems (e.g., classification 
of pedigrees that may be difficult to reconstruct combinatorially) , stochastic identifiability 
problems for idealised Markov models of recombination and mutation, and relationships 
between these problems. 

Reconstruction problems of purely combinatorial nature are well known to combinato- 
rialists, the foremost among such problems being the vertex reconstruction conjecture [T8] . 
The conjecture states that all simple undirected unlabelled graphs can be constructed from 
their collection of unlabelled induced subgraphs. Combinatorial reconstruction problems 
have also been studied in phylogenetics, for example, problems of reconstructing phylo- 
genetic trees from subtrees pp. 

Steel and Chang proved their phylogenetic identifiability results in two parts: compu- 
tation of the additive 'log determinant' function on pairs of leaves from the joint proba- 
bility distribution on leaf states, and the combinatorial problem of reconstructing a tree 
from the additive function, which had been solved by Buneman and Zareckh. Similarly 
the problem of reconstructing pedigrees under a recombination-mutation model may be 
solved in two parts: in the first part, we would like to reduce the identifiability question 
to an appropriate combinatorial reconstruction problem, and then in the second part we 
would like to show that the combinatorial reconstruction problem has a unique solution. 
To ensure the uniqueness of reconstruction, we will have to compute sufficiently strong 
combinatorial invariants of the pedigree from the joint distribution of extant sequences. 

Although we noted that distances between extant vertices in a pedigree are not suffi- 
cient to reconstruct a pedigree, we sketch a heuristic argument given in [17] that shows 
how the distances between extant vertices in a discrete generation pedigree may be ob- 
tained from sequence data. Suppose a and b are two extant individuals in a pedigree. 
Suppose further that in the pedigree there are n := n(k : a, b) pairs of paths such that 
one of the paths in each pair ends on a and the other ends on b, and the two paths in each 
pair start at a common ancestor of a and b in the k-th generation, and the two paths in 
a pair do not share any other vertex. Now if sufficiently many short recombination-free 
homologous segments of DNA of a and b are compared, then we would expect about 
n/2 2k of them to have a common ancestor in the k-th generation. Thus it may be possible 
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to estimate n(k : a, b). Such calculations are theoretically possible for small values of k 
assuming the population is large and the sequences are long. They then tried to use the 
numbers n(k : a, b) for all pairs {a, b} for all k to construct the pedigree. They compu- 
tationally found many pairs of non-isomorphic pedigrees that have the same number of 
pairs of paths of each length. One such example is the pair of pedigrees shown in Figure El 
which was also mentioned in |14j . 

But a more detailed analysis of sequence similarities (between multiple sequences, if 
required) under simple recombination and mutation models may give us more informa- 
tion than just pairwise distances between living individuals. In the main theorem of 
this paper (Theorem 15. 13 j) . we show that the joint distribution on extant sequences de- 
termines a class of combinatorial invariants (e.g., certain types of subgraph sequences) 
that supersedes pairwise distances between extant sequences and subtrees (genealogical 
trees) in a pedigree. We then show that pedigrees, such as the ones in Figure El that 
earlier seemed difficult to distinguish due to their combinatorial similarities (including 
the non-reconstructible pedigrees constructed in [15J) are distinguished by the class of 
invariants. 

This paper is organised as follows. In Section [5J we define pedigrees, alignments and 
subgraph sequences. In Section |3l we give a schematic description of the recombination 
process, and formalise three models for sequences: Model R (a model in which there are 
recombinations but no mutations), Model RM (a model in which there are recombinations 
and mutations), and Model M (a model of mutations for sequences evolving on trees). 
We then formulate identifiability problems for these models. In Section HI we analyse 
pedigrees with two generations under Model R. In Section El we prove the main theorem 
(Theorem l5.13p and demonstrate its applications. In the last section, we discuss a few open 
questions. A section on nomenclature follows the references where all symbols appearing 
in the paper and the number of the page on which they appear first are listed. 

2 Pedigrees, alignments and subgraph sequences 

We use the following notation for number systems and their subsets: Z for the set of inte- 
gers, Z + for the set of positive integers, N for natural numbers, [m] for the set {1, 2, ... , m}. 
Depending on the context, we write [a, b] for the set of integers {a, a + 1, . . . , b} or real 
numbers a < x < b (and similarly (a, b), [a,b) and (a, b] for open or half-open inter- 
vals in integers and reals). The set of all fc-tuples of elements of a set 5* is written as 
S k := {(si, S2, • • • , Sfc) \ Si e S,i e Ik}}. The set of all functions from X to S is written as 
S x : = {/ : X ->• S}. 

Next we introduce some graph theoretic notation. The vertex set and the edge set of 
a graph G are denoted by V(G) and E(G), respectively, and their cardinalities by v(G) 
and e(G), respectively. The in-degree and the out-degree of a vertex win a directed graph 
are denoted by d~{u) and d + (u), respectively. The degree of a vertex u in an undirected 
graph (or the total degree in a directed graph) is denoted by d(u). An arc from u to v 
in a directed graph and also an edge between u and v in an undirected graph is written 
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as uv, and it will be understood from the context whether uv is meant to be a (directed) 
arc or an (undirected) edge. When any two objects G\ and G2 are isomorphic, we write 
G\ = G2. The isomorphism class of an object G is written as ||G||. For a collection Q 
of labelled objects, we write \Q\ for the set of isomorphism classes of objects in Q. Let 
G and H be two directed or undirected graphs. We write G < H (or H > G) if G is 
isomorphic to a subgraph of H, and this notation may be used when G or H is unlabelled 
(i.e., they are just isomorphism classes). We write G C H (or H D G) when a labelled 
graph G is a subgraph (or a supergraph) of a labelled graph H. 

Definition 2.1 (General pedigrees). A general pedigree P(X,Y,U, E) of a set X is a 

directed acyclic graph on a vertex set U ^ XUY and a set of arcs E such that each vertex 
has in-degree or 2. The set X is the set of vertices with out-degree 0. The set Y is the 
set of vertices with in-degree 0. The vertices in X are called the extant vertices (or the 
extant individuals in the population). The vertices in Y are called the founder vertices 
(or the founders of the population). The order of the pedigree is \X\. The depth of a 
pedigree is the length of (i.e., the number of arcs in) a longest path in the pedigree. Two 
pedigrees P(X, Y, U, E) and Q(X, Z, V, F) are said to be isomorphic if there is a one-one 
map 7r : U — > V such that uv is an arc in P if and only if tt(u)tt(v) is an arc in Q, and 
tt(x) = x for all x G X. We denote the natural partial order on U by <, i.e., v < u if 
there is a directed path from u to v (or u — v). 

We define isomorphism only between pedigrees of the same set of extant individuals 
and require that it fixes all extant vertices because (informally speaking) we would like to 
treat extant vertices to be labelled and other vertices to be unlabelled. Throughout this 
paper, we will assume that all pedigrees have X as their set of extant vertices. 

Definition 2.2 (Diploid pedigrees). Let P(X,Y,U, E) be a pedigree. Suppose that U 
can be partitioned into unordered pairs of vertices such that the following conditions hold: 
the extant vertices are paired with extant vertices and the founder vertices are paired with 
founder vertices; two non-extant vertices v and w are paired if and only if there are arcs 
vu and wu in E; no two paired vertices have a common parent. Then P together with 
one such pairing is called a diploid pedigree. 

Thus any general pedigree is a haploid pedigree; a diploid pedigree is a haploid pedigree 
with a pairing of its vertices (although not all pedigrees admit such a pairing). The pairing 
in a diploid pedigree is completely determined by the pairing of its extant vertices since 
all other pairs are determined by the definition. An advantage of a purely combinatorial 
definition of a diploid pedigree is that we can now assign just one sequence to each vertex. 
From this point of view, haploid pedigrees are pedigrees of sequences, not of individuals. 
Pedigrees of individuals are diploid pedigrees obtained by pairing sequences in a haploid 
pedigree. Figure [1] illustrates this point of view. When vertices (sequences) Ai and P>i 
are paired, Aj and Bj must be paired, since they are parents of A(, similarly, A k and B k 
must be paired. 

Definition 2.3 (Sequence alignments and characters). Let £ be a finite set (called 
an alphabet) and U a finite set. A character on U is a map C : U — > E. For L E N, 
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Aj IS A k B k {Aj,Bj} {A k , B k ] 




a haploid pedigree a diploid pedigree 

Figure 1: Pairing vertices (sequences) of a haploid pedigree 

an alignment of length L on U is a map A : U — > S L . Equivalently, an alignment is an 
L-tuple (Ci, C 2 , . . . , C£) of characters on [/, or a two dimensional array of symbols from 
X, with \U\ rows and L columns. The rows of the array are called sequences, and are 
written as A(i), % — 1 to |C/|. Individual entries in the array are written as A(i, j), where 
i = 1 to \U\ and j — 1 to L. The columns of an alignment are called szies. 

Usually [/ will be the set of vertices of a pedigree, and we will be interested in align- 
ments restricted to the set X of extant vertices of the pedigree. The space of characters 
on X is X x , which is also referred to as the space of site patterns. The set of alignments 
of length L on a set X is (X X ) L , which we write as T, XL . 

Let P{X, Y, U, E) be a pedigree and let A G be an alignment on P. If the 
sequences in A have evolved under some process of recombination or mutation, then 
(regardless of the details of the model of recombination or mutation) we may suppose 
that each site in each sequence is inherited only from one of the two parent sequences. 
Therefore, for each u G U and each j e [L], there is a unique directed path P U j from some 
founder vertex y U j to u that defines the genetic ancestry of A(u,j). Therefore, each site j 
has associated with it a spanning forest Gj := U ue uPuj, and each alignment of length L 
has an underlying (usually unknown) sequence {Gj,j = 1,2, ... ,L) of spanning forests. 
Similarly, each site j has associated with it a directed subforest defined by Tj := U xe x Pxj, 
and we have a directed subforest sequence (Tj,j = 1,2, ...,L) of the alignment. Here 
we define these notions purely graph theoretically (without reference to alignments or 
models). 

Definition 2.4. Let X be a finite set. A directed X-forest T is a directed forest that 
satisfies the following conditions: the set of vertices with out-degree is X, and is called 
the leaf set of T; each component has a single vertex of in-degree 0, called the root vertex of 
the component, and all other vertices have in-degree 1; all arcs are directed away from the 
root vertices. An undirected X-forest is an unrooted forest with leaf set X. Suppose T is a 
directed X-forest. It induces a natural partition of X into maximal clusters Xi, X 2 , . . . , Xk 
such that vertices in cluster Xi have a unique most recent common ancestor (MRCA) Ui 
in T. We construct a subgraph of T induced by u i: i G [k] and their descendants in 
T, and then replace its (directed) arcs by (undirected) edges. The resulting undirected 
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unrooted graph is called the undirected X -forest of T, written as T U (T). It is the maximal 
undirected X-forest contained in T. 

In the above definition, the term X-forest is meant to be analogous to the term X-tree 
that is commonly used in phylogenetics [12]. In this paper we will adapt some of the phy- 
logenetic identifiability results for undirected X-forests that appear as undirected graphs 
underlying subgraphs of pedigrees. Therefore, in the following definition, we specialise 
the terms for pedigrees. 

Definition 2.5. Let P be a pedigree. A spanning forest of P is a spanning subgraph G 
of P such that the in-degree of each vertex in G is 1 unless it is a founder vertex in P. A 
directed X-forest of P is a subgraph T of P such that T is a directed X-forest and the 
root vertex of each component of T is a founder vertex in P. Each spanning forest G in 
P contains a unique directed X-forest of P, and we denote it by Td{G). An undirected 
X-forest in P is the unique undirected X-forest in any directed X-forest in P. Each 
spanning forest G in P contains a unique undirected X-forest of P, and we denote it by 
T U (G) . 

Note that we use the term spanning forest in a specific sense: a spanning forest is not 
any spanning forest in the graph theoretic sense. We illustrate these terms in Figure [21 
which shows a pedigree and a spanning forest G with E(G) = {da, al, a2, eb, b3, fc} 
(shown by bold arcs). In this example, the unique directed X-forest T^{G) in G has the 
arc set {da, al, a2, eb, 63}; its clusters are X x = {1, 2} and X 2 = {3}, and the root vertices 
of its components are d and e. The unique undirected X-forest T U {G) in G consists of 
vertex set {1, 2, 3, a} and edge set {al, a2}. Note that vertex 3 is isolated in T U (G) since 
it is the MRCA of its cluster, but we need it in our analysis. 

d e f 

f f f 

\ ✓ \ 

\ ✓ \ 

\ / s 

N S \ 

V \ 
✓ \ \ 
/ N \ 

✓ \ \ 




1 2 3 



Figure 2: A pedigree and a spanning forest, which is shown by bold arcs 

Definition 2.6. Two (directed or undirected) X-forests T and T' are said to be isomor- 
phic (written T ~ X") if there is a graph theoretic isomorphism 7r from T to T' such that 
7t(x) = x for all x in X. The isomorphism class of a directed or an undirected X-forest 
T, denoted by ||T||, is the set of all X-forests T' that are isomorphic to T. 
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The set of all directed X-forests of P is denoted by Tp. The set of isomorphism classes 
of directed X-forests of P (or the set of distinct directed X-forests in P) is denoted by 
||Tp||. The set of spanning forests of a pedigree P is denoted by Qp. The set of undirected 
X-forests of P is denoted by Up. The set of isomorphism classes of undirected X-forests 
of P (or the set of distinct undirected X-forests in P) is denoted by ||Wp||. 

Proposition 2.7. Each spanning forest of a pedigree P has e(P)^2 arcs and contains 
exactly one directed X-forest. There are 2 e ^ p ^ 2 spanning forests. Each directed X -forest 
T is contained in 2 (e ( p )~ 2e(T ))/ 2 spanning forests. 

Proof. Each spanning forest of P is obtained by selecting one of the two arcs that point 
to each non-founder vertex. The unique directed X-forest T in a spanning forest G is the 
subgraph of G spanned by vertices in G that are ancestral in G to the extant vertices. 
Finally, e(P) — 2e(T) is the number of arcs not pointing to vertices in T, therefore, 
(e(P) — 2e(T))/2 is the number of non-founder vertices outside T, at each of which we 
can choose one of the two incoming arcs to construct a spanning forest containing T. □ 

In any model of recombination, the number of recombination events is determined by 
the spanning forest sequence underlying an alignment, but we define it for all spanning 
forest sequences without reference to alignments or models. 

Definition 2.8. Let P be a pedigree and let G := (Gi,i — 1, 2, . . . , L) be a spanning 
forest sequence in P. If vu and wu are distinct arcs in P such that vu is in Gj and wu is 
in Gi + i, then we say that a recombination has occurred at site % at vertex u or that % is a 
recombining site. If vu is an arc in G{ and then we say that there is no recombination 
at u at site i. We define the number of recombinations in G to be 

L-l 

r(G) :=Y,\E(G t+1 )AE(G t )\/2, 
i=i 

where \E(G i+ i) AE(Gi)\/2 is the number of recombinations separating Gi and G i+ \. The 
number of points of no recombination is 

L-l 

s(G) :=J2\ E (G i+ i)nE(G t )\. 

i=l 

The directed X-forest sequence of G is the sequence Td := (Td(Gi),i = 1, 2, . . . , L), and 
the undirected X-forest sequence of G is the sequence T u := (T u (Gi),i = 1, 2, . . . , L). 

3 Models R and RM, and identifiability problems 

We assume that in any reasonable model of sequence evolution, sequences are first as- 
signed to the founder vertices, and then subsequent generations of individuals inherit their 
sequences from their parents' sequences subject to recombinations and mutations. We are 
then interested in the following types of identifiability questions. 



9 



Problem 3.1. Suppose sequences of equal length over a finite alphabet are assigned to 
the founder vertices of a pedigree. The sequences then evolve on the pedigree undergoing 
recombinations and mutations, giving a probability distribution on the space of align- 
ments on the set of extant vertices. Can we determine the pedigree uniquely (i.e., up to 
isomorphism) - with or without the knowledge of the size or the depth of the pedigree 
or the various probability parameters defining the recombination and mutation processes, 
with or without restrictions such as discrete generations or constant population, and so 
on? In the case of diploid pedigrees, we will be given the distribution on alignments on 
the set of extant vertices along with a pairing of extant vertices. 

In this paper, we study the above types of questions under two simple models of 
recombination and mutation. In Model R, we assume that sequences evolve on a pedigrees 
under a process of recombinations without mutations. In Model RM, we assume that 
sequences evolving on a pedigree undergo recombinations and mutations. For convenience, 
we also formalise the mutation part of Model RM for the spanning forests of pedigrees 
and in general for directed X-forests, and call it Model M. 

In all these models, we assume that first all the founder vertices of a pedigree (or the 
root vertices of a spanning forest or a directed X-forest) are assigned sequences. These 
sequences are independently selected from a uniform distribution on S L , where S is a 
known finite alphabet. Then the sequences evolve on the pedigree (or a spanning forest 
or a directed X-forest) in a top-down manner, i.e., a vertex is assigned a sequence only 
after its parents have been assigned sequences. 

We begin with a schematic description of the recombination process. Our description 
is largely based on Chapter 12 of [7|. Figure E] schematically shows the process of gamete 
(sperm or egg) formation in eukaryotes. Initially there is a parent cell with one pair of 
homologous non-sex chromosomes. Each chromosome is then duplicated with the identical 
sister chromatids joined together at the centromere, forming a four-strand bundle. Then 
the two duplicated chromosomes exchange material between chiasmata (recombination 
points). In the diagram, there are three recombination events. The first recombination is 
between strands 1 and 3 (counted from top to bottom), the second is between strands 2 
and 4, and the third is between strands 2 and 3. The four chromatids after the exchange 
of material are shown next. Then the cell undergoes two cell divisions to create four 
haploid gametes, each receiving one of the four chromosomes. 

As shown in the diagram, at each recombination point a crossover occurs between 
one strand from the first pair and one strand from the second pair. At each crossover, a 
strand from the first pair and a strand from the second pair are chosen randomly with 
equal probability (independent of other chiasma). This independence property is known 
as the lack of chromatid interference. 

Suppose that the locations of recombinations are modelled as a Poisson point process 
along the sequence (or on [0, oo)) with the rate A or a Bernoulli process with probability 
p (so that a crossover occurs after a site on a sequence with probability p independently 
of other sites or sequences). Since exactly two of the four gametes - one from the first 
pair and one from the second pair - inherit any recombination, any given gamete inherits 
a recombination with probability 1/2. Therefore, for the sequence of any fixed gamete, 
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the locations of recombinations are still modelled by a Poisson process, but with the rate 
A/2, (or Bernoulli process with probability p/2). Therefore, a model may be formalised 
with just two parent sequences instead of four. A Poisson process for the locations of 
chiasmata was first proposed in [6J. Based on the above description, we formalise models 
R and RM, in which we assume that crossovers in a finite sequence occur according to a 
Bernoulli process. 



A 




Figure 3: Schematic description of recombination for diploid cells - (A) two homologous 
chromosomes in a parent cell (B) each chromosome is duplicated and a 4-strand bundle 
is formed (C) the sister chromatids of the first chromosome exchange material with the 
sister chromatids of the second chromosome between recombination points 1,2,3 (D) four 
chromatids after the exchange of material (E) the four strands are inherited by four 
gametes 



Model R: Consider three sequences A(i) of length L over alphabet E, where % e {u,v,w} 
and v and w are parents of u. The sequence A(u) is obtained by recombining sequences 
A(v) and A(w) as follows. Let X\,X2, ... be a Markov chain on the state space {v,w} 
with transition probabilities pij = p if i ^ j for i, j 6 {v, w}, and Pr{Xi = v} = Pr{Xi = 
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w} = 1/2. Then, for k = 1, 2, . . . , L, A(u,k) <- A(i,k) if X k = i. Thus X k+1 ^ X k 
indicates a crossover from one sequence to the other. We refer to this model as Model R. 

Model RM: Consider three sequences A(i) of length L over alphabet E, where % G {u, v, w} 
and v and w are parents of u. The sequence A(u) is obtained from the sequences A(v) 
and A(w) by a process of recombinations and mutations as follows. Let X 1: X 2: ... be a 
Markov chain on the state space {v,w} with transition probabilities Pij = p if % ^ j for 
i,j G {v,w}, and Pr{Xi = v } = Px{X 1 = w} = 1/2. Then if X k = i and A(i,k) = r, 
then A(u, k) is assigned r with probability 1 — (E — l)/x, and fc) is assigned a state 
different from r with probability (E — When a state different from r is assigned to 
A(u,k), each state in S\{r} has equal probability fi of being assigned to A(u,k). We 
refer to this process as Model RM. 

Model M: This process is defined for spanning forests of a pedigree and directed X-forests. 
First each founder or the root vertex in each component of a directed X-forest is assigned 
independently and uniformly randomly a state from E. Suppose j is the parent vertex 
of i in a spanning forest of a pedigree or in a directed X-forest. Let A(i) and A(j) be 
the sequences of % and j, respectively, both of equal length L over alphabet E. Then for 
each k G [L], A(i, k) is assigned the same state as A(j, k) with probability 1 — (E — l)/i, 
and A(i,k) is assigned a state different from A(j,k) with probability (E — l)/i. When 
a state different from A(j,k) is assigned to A(i,k), each state in Y\{A(j,k)} has equal 
probability fi of being assigned to A(i, k). We refer to this process as Model M. 

Model M on a directed X-forest T is equivalent to a similarly formulated model on 
the undirected X-forest of T. We root each component of the undirected X-forest ar- 
bitrarily, and assign to it a state from E uniformly randomly, independent of the roots 
of other components. The state then evolves away from the root in each component. If 
a component itself is an isolated vertex, it is simply assigned a state from E uniformly 
randomly. Since the mutation model described here is reversible, the same distribution 
on the site patterns is observed on X in the undirected X-forest as in a directed X-forest 
for a given /i. Therefore, when we try to construct a tree from the character distribution 
on its leaves, we cannot construct the directed X-forest, but we can at best construct 
the undirected X-forest in it. Therefore, we will consider Model M only on undirected 
X-forests. 

Thus Model RM may be thought of as a synthesis of Models R and M so that the 
recombination-free segments of sequences evolving on a pedigree may be examined under 
Model M with phylogenetic methods. 

Let P be a pedigree on X. For an alignment A G E XL , we denote by Pr{A | 
P, RM(p, fi)} the probability that sequences of length L evolving on the pedigree P under 
model RM(p,fi) give an alignment A on X. We use analogous notation when the model 
RM(p, fi) is replaced by the model R(p), or when the pedigree P is replaced by a directed 
or an undirected X-forest T and the model under consideration is the mutation model 
M(fi). We denote the various probability spaces by (E XL : P, RM(p, fi)), (E XL : P, R(p)), 
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(S XL : T, M(//)), (S x : T, M(/i)), and so on. For pedigrees P and Q, we will write : 
P,RM(p,fj)) = (£ XL : Q,RM(p,fi)) when Pr{A | P,RM(p,p)} = Pr{A \ Q,RM(p,p)} 
for all A G S Xi , and analogously for other models. 

As in the case of alignments, we treat the spaces of spanning forest sequences, directed 
X-forest sequences, and undirected X-forest sequences as probability spaces, and denote 
them by (Q L : P,R{p)), (T L : P,R{p)), and (U L : P,R{p)), respectively. The spanning 
forest sequences, and directed and undirected X-forest sequences, (and the corresponding 
sequences of isomorphism classes of spanning forests and directed and undirected X- 
forests) are defined by only the recombination events, therefore, the probability spaces 
are unchanged if R(p) is replaced by RM(p,p). For a spanning forest sequence G : = 
(Gi, G 2 , . . . , Gl), we will write Pr{G | P, R(p)} for the probability of G in the probability 
space {Q L : P,R(p)). We will use analogous notation for other sequences and probability 
spaces. Unless stated otherwise, when we will refer to alignments or sequences of spanning 
forests or other objects, we will mean alignments or sequences of spanning forests or other 
objects, respectively, from the appropriate probability spaces that are clear in the context. 

Definition 3.2. Nonisomorphic pedigrees P and Q in a class C are said to be distinguished 
from each other under model RM(p,p) if : P,RM(p,p)) ^ (S XL : Q,RM(p,p)) 

for some L, (i.e., for some L, there exists A C S XL such that Pr{A | P, RM(p, p)} ^ 
Px{A | Q, RM(p, //)}). A pedigree P in a class C is said to be identifiable under model 
RM(p, p) if it is distinguished from every other pedigree Q in C, (i.e., if there is a pedigree 
Q in C such that (S XL : P,RM(p,p)) = (S XL : Q,RM(p,p)) for all L e Z+, then Q is 
isomorphic to P). Pedigrees in a class C are said to be identifiable under model RM(p, p) 
if all pairs of pedigrees in C are distinguished from each other under model RM(p,p). 

Similar terminology will be used for other models and for undirected X-forests. Stronger 
notions of identifiability may be defined, and correspondingly stronger variants of iden- 
tifiability questions may be asked. The above definitions assume the model parameters 
to be fixed. But we may ask if there are nonisomorphic pedigrees P and Q and model 
parameters p,p',p,p' such that (S XL : P, RM(p, p)) = (E XL : Q,p',p') for all L G Z + . 
Given the probability distribution a pedigree induces on the space of alignments, we may 
ask if the pedigree can be recognised to be in a class C. We assume in all results in 
this paper that the model parameters and the class C (typically defined by the size of a 
pedigree) to be fixed (but possibly unknown). 

Remark 3.3. Pedigrees of order 1 are in general not identifiable since for all pedigrees 
of order 1, the extant sequence will be uniformly distributed. In fact for all pedigrees, 
all extant sequences will be uniformly distributed. But when there are more than one 
extant vertices, the joint distribution of extant sequences will contain some information 
about the pedigree on which they have evolved, because, for example, some vertices may 
have common ancestors, so their sequences will be correlated. We will therefore consider 
identifiability questions for pedigrees of order more than 1. 

The models R and RM on a pedigree P with e(P) = 2e arcs may be interpreted as 
Hidden Markov Models with the set of hidden states Q P and the set of observed states S x . 
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This is illustrated in Figure HI The initial probability for each hidden state is l/2 e . The 
probability of an observed state conditional on a given hidden state G G Qp can be easily 
computed (and actually depends only on T U (G) and /i). The probability of transition 
from state Gi to Gj is given by 

Pr{G, | Gi}=p \E(G J )AE(G l )\/2 {l _ p) \E(G J )nE( Gl )\ ^ 

In most contexts in which HMMs are used, one assumes that the set of hidden states 
in known, but here we do not know the set Qp. We also note that the sequence of X- 
forests is not a Markov chain. Consider directed X-forests Tj and T J+1 at sites j and 
j + 1, respectively. Suppose a vertex u in V(Tj + i)\V(Tj) has parents v and w in the 
pedigree. The probability that the arc v u is in Tj+i depends on the history before the j- 
th site. For example, if vu was in 7}_i then the probability that it would also be in T J+ i is 
(1 — p) 2 +p 2 . But if vu was not in the probability that it would be in T J+ i is 2p(l — p). 
Therefore, T±, T 2 , . . . is not a Markov chain, i.e., Pr{Tj + i | Tj, 7)_i, . . . , Ti, P, R(p)} may 
not be the same as Pr{T J+1 | Tj, Tj_ 1} . . . , T{, P, R(p)} if the sequences . . . , T\ and 
Tj_i_, . . . , T[ are different. Therefore, we cannot interpret the models as HMMs with 
directed or undirected X-forests as hidden states. 




Figure 4: An observed sequence C±, C 2 , ... of characters at the extant vertices of a pedigree 
P. Among the intermediate chains, only G\, Gi, . . . is a Markov chain. 



4 Analysis of some examples under Model R 

In this section, we analyse diploid pedigrees and certain haploid pedigrees that obey many 
properties of diploid pedigrees under model R. We show that, with some exceptions and 
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mild conditions on p and diploid pedigrees of depth 2 are reconstructible from the 
probability distribution on the alignments on extant vertices. We use very basic techniques 
such as pairwise comparisons between extant sequences to exploit the correlation between 
them to reconstruct their pedigree. 

Let P be a pedigree and let T be an undirected X- forest. We define n(G > T : P) :— 
\{Geg P :T u (G)=T}\. 

Proposition 4.1. Let P be a pedigree with e(P) = 2e arcs. Let C G T, x be any character. 
Then the probability that the k-th character in an alignment is C is given by 

Pr{C k = C\P,RM(p^)}= niG> 2 J :P) Pr{C\T,M^)}. (4.1) 

Te\\u P \\ 

In particular, it does not depend on k. 

Proof. Let G 1 ,G 2 , • • • be a spanning forest sequence. It is a time- homogeneous Markov 
chain on Qp, with transition probabilities given by 

Pr{G i+1 = G' | Gj = G}= p |£(G')A*(G)|/2 (l _ p) \E(G')nE { G)\_ 

It follows that Pr{G k = G} = l/2 e for all k G Z + . Let C\, C*2, . . . G S x be a sequence 
of characters. Then, under Model RM, 

Pr{C k = C | P,RM(p,fi)} (4.2) 

= Pr {Cfc = C \ G k = G}Pr{G k = G\P, RM(p, //)} 

G&g P 

= j e E Px ^ = c i G * = G > 

Geg P 
TE\\U P \\ 

A similar result holds when Model RM is replaced by Model R. □ 

The above proposition implies that if two pedigrees have the same number of undi- 
rected X-forests of each type and the same number of arcs, then the character frequencies 
in alignments alone are not sufficient to distinguish the two pedigrees. For example, 
pedigrees in Figure \5\ cannot be easily distinguished. 

But it turns out that, under Model R, most diploid pedigrees of depth 2 are easily 
distinguished by making pairwise comparisons between extant sequences and computing 
the probability that they agree at a site. The above proposition implies that the probabil- 
ity that two sequences in an alignment agree at a site k does not depend on k in Models 
R and RM. Moreover, such a probability can be computed easily under Model R. Any 
given undirected X-forest T of a pedigree induces a partition of its leaves so that the 
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12 12 

Pi Qi 

Figure 5: Nonisomorphic haploid pedigrees 

leaves within a component are in the same part. Under Model R, the probability that the 
sequences at two leaves i and j are in the same state at a site is 1 if they are in the same 
component of T. Otherwise, the probability is 1/|S|. Under Model RM, the probability 
depends on \x if they are in the same component, and is 1/|S| otherwise. 

Proposition 4.2. Let P be a diploid pedigree of depth 2. Let i,j £ X . Then the subpedi- 
gree of i and j is determined by the probability distribution induced on £l*' J I under Model 
R. 

Proof. There are four possible ways in which any two vertices i and j are related in a 
diploid pedigree, which are shown in Figure |6j For each of them, we give the probability 
that the sequences A* and Aj match at any site k. In the following, we set 5 := t^t- 




Figure 6: Four ways in which i and j may be related 

Case 1: If i and j have the same parents, then Pr{A(i, k) = A(j, k)} = 85 + (1/2). 
Case 2: If i and j have distinct pairs of parents but the same grand parents, then 
Pr{A(«, k) = A(j } k)} = 125 + (1/4). 

Case 3: If % and j have distinct pairs of parents but exactly one pair of grand parents in 
common, then Pr{A(i, k) = A(j, k)} = 145 + (1/8). 

Case 4: If i and j have no common parents or grand parents, then Pr{A(i, k) = A(j, k)} = 
165. 
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Therefore, unless |S| = 1, the above cases are distinguished by the marginal joint 
distribution on under Model R. □ 



Remark 4.3. The assumption that P is a diploid pedigree is essential. It implies that % 
and j are not related as in the haploid pedigree Q\ shown in Figure [51 We can verify that 
Pr{A(i, k) = A(j, k)} = 125 + (1/4) for both Pi and Qi, so they are indistinguishable by 
the above method, which was also pointed out as a consequence of Proposition 14.11 

Remark 4.4. The above probabilities do not depend on p, therefore, we have a slightly 
stronger identifiability statement: If diploid pedigrees P and Q of depth 2 and crossover 
probabilities p and p' are such that (S x : P,R(p)) = (S x : Q,R(p')) then their subpedi- 
grees of order 2 are correspondingly isomorphic. 

Proposition 4.5. When |E| > 2, pedigrees of depth 2 in which no two vertices have 
exactly one common parent are identifiable under model R. In particular, when |S| > 2, 
diploid pedigrees of depth 2 are identifiable under model R. 

Proof. There are only 4 ways in which any two extant vertices i and j are related. They are 
illustrated in Figure El Each of the possible relationships is recognised by Proposition 14.21 
We denote the 3rd and the 4th types of relationships by i ~ j and i ^ j, respectively. 

Suppose that no two extant vertices i, j in a pedigree are related to each other as 
i ~ j or i r£ j. Then the pedigree is constructed by adding extant vertices one by one. 
On each step we add one extant vertex and join to previously added extant vertices as 
in Figure [6]- (i) or (ii), whichever is appropriate. Therefore, we assume that at least two 
extant vertices i, j are related as % ~ j or i ^ j. 

Let Z be a (nonempty) maximal subset of X such that for any two distinct extant 
vertices i and j in Z, either i ~ j or i ^ j. Every other extant vertex k not in Z is related 
to some vertex in Z as in Figure [6]- (i) or (ii). Therefore, once the subpedigree of Z is 
constructed, there is only one way to extend it to the whole pedigree. 

To construct the subpedigree of Z, we first construct an edge labelled graph with edge 
set Z in which edges i and j are incident if and only if % ~ j. This is a known problem 
in graph theory, namely, the problem of constructing an edge labelled graph from its line 
graph. It was proved [19] that there are only 4 pairs (Gi, Hi) of connected nonisomorphic 
edge labelled graphs that have the same line graphs. Edge labelled graphs that cannot 
be uniquely constructed from their line graphs must contain components isomorphic to 
Gi or Hi. We refer to (in particular, Chapter 15, Problem 1) for discussion about 
reconstructing graphs from their line graphs, in particular, for the complete list of pairs 
(Gi,Hi). The first pair is (i^i,3, K^) (with edges of each of them labelled i,j,k). Based 
on the example (K^ 3 ,K S ), we construct pedigrees shown in Figure [3 in which all pairs 
of extant vertices are similarly related. 

We distinguish the two pedigrees in Figure[7]by comparing the probabilities Pr{A(i, s) = 
A(j, s) = A(k, s) | P 2 , R(p)} and Pr{A(i, s) = A(j, s) = A(k, s) | Q 2 , R(p)} for any site s. 

In P 2 , there are 512 spanning forests. Among them there are 192 spanning forests in 
which two extant vertices have a common grandparent, giving the first term on the RHS 
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i j k i j k 

Pi Q2 

Figure 7: Pedigrees indistinguishable by site pattern probabilities 

below. In the remaining 320 spanning forests, no two extant vertices have a common 
grandparent, which explains the second term on the RHS below. Therefore, 

1Q2 Q9D 

Pr{A(*, a) = A(j, s) = A(k, s) | P 2) R(p)} = — — + ' " 



512|E| 512|E| 2 

In Q 2 , there are 512 spanning forests. Among them there are 16 spanning forests in 
which i,j,k have a common grandparent (giving the first term on the RHS below), 144 
spanning forests in which two extant vertices have a common grandparent at site s (giving 
the second term), and 352 spanning forests in which i,j, k have distinct grandparents at 
site s (giving the third term). Therefore, 

r 4/ N 4/ N 4/. N . ^ T,/ 16 144 352 

Pr{A{t, s) = A{j, s) = A{k, s) | Q 2 , R{p)} = — + + 



512 512|E| 512|E| 2 

Whenever |E| > 2, Pr{A(i, s) = A(j,s) = A(k,s) | Q 2 ,R(p)} > Pr{A(i,s) = A(j,s) = 
A(k,s) I P 2 ,R(p)}, therefore, P 2 and Q 2 can be distinguished. The two expressions are 
equal when |E| = 2. Similarly, other pedigrees constructed from (Gi,Hi),i = 2,3,4 are 
distinguished when |E| > 2. 

Since there are only two types of site patterns for three sequences when | E | = 2 (either 
the three sequences agree at a site or exactly two of them agree), the two cases cannot be 
distinguished by considering other site pattern probabilities. 

In diploid pedigrees, no two vertices have exactly one common parent, therefore, when 
|E| > 2, they are identifiable under model R. □ 

We used only site pattern probabilities in the above proofs. But because of recombi- 
nations, consecutive sites in an alignment are not independent. We use the dependence 
between sites to eliminate the restriction |E| > 2 when the crossover probability p is 
sufficiently small. 

Proposition 4.6. When |E| = 2 and p is sufficiently small, haploid pedigrees P 2 and Q 2 
(shown in Figure^ are distinguished under model R(p). 
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Proof. We compute the probability that, in an alignment A, there are long runs of sites at 
which all the three sequences Ai, Aj and A^ are equal. In particular, we compute bounds 
on Pr {A(i,m) = A(j,m) = A(k,m)Vm e [l,t + 1]} on the two pedigrees. 

For P 2 , for any fixed m, if any two of the three sites A(i,m), A(j,m), A(k,m) are 
inherited from the same grand parent, then Pr{A(i,m) = A(j,m) = A(k,m)} = 1/2, 
and if all of them are inherited from distinct grandparents, then Pr{A(i, m) = A(j, m) = 
A(k,m)} is 1/4. Therefore, 

Pr{A(i, m) = A(j, m) = A(k, m)Vm e [1, t + 1] | P 2 , R(p)} < (l/2) m . 

For Q 2 , the probability that the first site of all sequences is inherited from a common 
grand parent is 1/32. The probability that at each successive site the three sequences 
have a common grand parent is (1 — p) 6 + p 3 (l — p) 3 . Therefore, 

Px{A(i, m) = A(j, m) = A(k, m)Vm <E [1, t + 1] | Q 2 , R(p)} 

> ((l-p) 6 +P 3 (l-p) 3 ) f 
32 

When p is sufficiently small and t is sufficiently large, the above probability for Q 2 is more 
than that for P 2 . □ 

We do not analyse other examples of pedigrees based on graphs that are not recon- 
structive from their line graphs (graphs Gi, Hi, i = 2, 3, 4 mentioned in Proposition 14. 51) . 
but they may be analysed similarly. 



5 Reconstructing pedigrees under Model RM 

In this section we develop ideas from Section H] (especially Proposition 14.61) in much more 
generality. Earlier we observed that parts of alignments that are free of recombination may 
be analysed with phylogenetic methods. Since we do not know where the recombinations 
have occurred in an alignment, we choose long segments of carefully chosen alignments (or 
sets of alignments) and show that they have higher probability of having evolved on one 
particular X-forest (or a sequence of X-forests) than any other X-forest (or a sequence of 
X- forests). For example, the method of Proposition 14.61 works because Q 2 contains a tree 
in which i, j, k have a common ancestor and P 2 does not contain such a tree. Therefore, a 
sufficiently long sequence of characters in which sequences A(i), A(j) and A(k) are in the 
same state will more likely have evolved on a pedigree such as Q 2 than on a pedigree that 
does not contain such a tree. This argument may be generalised to count the number of 
X-forests of each type from the distribution of extant sequences. Such a generalisation 
requires identifiability results for phylogenetic trees, which we state in a form suitable for 
our application. 
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5.1 Identifiability and consistency results for X-forests 

In this section, we state known results on identifiability and statistical consistency of 
maximum likelihood reconstruction of phylogenetic trees. We need to adapt them slightly 
since the X-forests in a pedigree differ from phylogenetic trees in three respects - they 
may have vertices of degree 2, they may be unresolved (i.e., they may have vertices of 
degree more than 3), and they may be disconnected (two extant vertices may not have 
a common ancestor in a given directed X-forest in a pedigree). We address them in the 
following identifiability result, which was originally proved for phylogenetic trees in [10] 
in the |E| =2 case. Identifiability and statistical consistency of maximum likelihood 
reconstruction of phylogenetic trees were independently proved in full generality for all 
|E| > 2 in [131 g . 

Theorem 5.1. For all fi G (0, 1/|S|) and any two undirected X -forests T\ and T 2 with 
bounded number of edges, if Pr{C | Ti,M(fi)} = Pr{C | T 2 ,M(//)} for all characters 
C E S x thenT x T 2 . 

Proof. The result follows from the analogous results in [131 0] for phylogenetic X-trees, 
but we have to clarify three issues: unlike the phylogenetic X-trees, the X-forests as 
defined in this paper may be disconnected, unresolved, and may have vertices of degree 
2. 

Connectivity: Any two extant vertices Xi and Xj are in different components of T\ 
and T 2 if and only if Pr{C(i) = a | C(j) = b} = 1/|S| for all a, b G X, where C(i) is 
the state at the extant vertex i in a character C. If % and j are in the same component, 
then Pr{C(i) = a \ C(j) = b} cannot be arbitrarily close to 1/|S| if the number of edges 
(and hence the distance between i and j) is bounded. Therefore, we can consider the 
identifiability question for each component separately. 

Unresolved X-forests: Given an unresolved phylogenetic tree, there are resolved phy- 
logenetic trees with site pattern probabilities arbitrarily close to the site pattern proba- 
bilities for the unresolved tree. Therefore, even though unresolved phylogenetic X-trees 
are identifiable, statistical consistency of maximum likelihood methods requires that the 
substitution probabilities on the edges of a phylogenetic tree are bounded below by a pos- 
itive real number. In our model, Corollary 15.31 below is possible because the substitution 
probability fi is fixed on each edge. 

Vertices of degree 2: Let u and v be any two vertices of an X- forest. Suppose that u 
and v are of degree 1 or more than 2. Suppose all internal vertices on the path between u 
and v have degree 2. Since \i is fixed for all arcs, the distance between u and v on a tree is 
determined by the substitution probability on the uv path. The substitution probability 
on the uv path is determined by the distribution on the space of characters. □ 

Remark 5.2. In the above result, if T\ and T 2 were directed X-forests, then we would be 
able to conclude that T u (Ti) = T U (T 2 ). 

Let N := |S|I X L Suppose that S x := {Cj,z G N}. We associate with each undirected 
X-forest T a vector p(T,/x) := (pi,P2, ■ ■ ■ ,Pn) m where Pi := Pr{Cj | T,M(/i)}. 
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Therefore, the condition Pr{C | T\, M(/i)} = Pr{C | T 2 , M(/i)} for all characters C G S x 
may be equivalently written as p(Ti,/j) = p(T 2 ,/i). 

Given r G M + and a point s G M. N , let the open ball of radius r centred at s be denoted 
by p(s,r). Here the radius may be taken to be in the 1-norm (i.e., the distance between 
points x := (xi,x 2 , ■ ■ .,x N ) and y := (yi,y 2 , ■ ■ ■ ,Vn) is defined by d(x,y) = £)f \xi~yi\)- 
Let A be an alignment on X. We define a vector f (A) := / 2 , . . . , /at), where are 
the fractional site pattern frequencies, i.e., /, is the fraction of columns of A of type Cj. 
Then the above identifiability result implies a statistical consistency result for maximum 
likelihood. It informally says that as the length of a random alignment A goes to infinity, 
we expect f (A) to be arbitrarily close to p(T, p) with probability approaching 1 if T is the 
true X-forest, and that the probability that f (A) is arbitrarily close to p(T, p) approaches 
if T is not the true X-forest. 

Corollary 5.3. For all r G R+, e G (0,1) and p G (0, 1/|£|), there exists L := 
L(r ,e,p) G N smc/i £/ia£ for any two undirected X -forests T and T' such that T ^ T' , 
and an alignment A G T, XL , 

Pr{f (A) G p(p(T, /i), r ) | T, M(p)} > 1 - e 

and 

Pr{f(A) G p(p(T,p),r ) | T',M(p)} < e. 

We give bounds on the above probabilities in terms of L, which we prove using Bern- 
stein's inequality. 

Lemma 5.4 (Bernstein's inequality). Let X, Xi,X 2 , . . . be i.i.d. Bernoulli random vari- 
ables with Pr{X = 1} = p. Then for all r > and n G Z +; 

Pr J - p > r J. < exp > 



n J ~ l2p(l-p) + 2r/3j ' 



and (equivalently) 



: < ^ i=1 X " - p < -r } < exp 



o 

-W 



ra J~ i2p(l -p) + 2r/3j ' 

Lemma 5.5. Lei r G R+ and p G (0, 1/|E|). Lei A G S XL . Let T &e an undirected 
X-forest. Then 



Pr{f(A) £p(p(T,p),r ) | T,M(/i)} < 2|E|l x lexp 



-Lr 2 



|S| 2 I X I , 2r |S|l*l 



Proof. There are distinct characters, with probabilities Pi := Pr{Cj | T, M(/i)}, 

i = 1 to |S|' X L If f(A) ^ p(p(T, p), r ), then |/j — Pi| > r /|S|' x ' for some Therefore, 
we apply Bernstein's inequality to each distinct character, and write the probability that 
\fi — Pi\ > ?"o/|£|' x '- We then apply the union bound to get the result. □ 
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If we set 

_ min{d(p(T;, //), p(Tj, fi)) : T u 7} G W, Tj ^ Tj} 
r o 2 ' ^ ' 

it will ensure that the X-forests are separated by open balls of radius r in the space of 
site pattern probability vectors, i.e., the open balls p(p(T i , //), ro) and p(p(Tj, /i), r ) are 
non-intersecting whenever Tj and Tj are non-isomorphic. We will use this value of ro 
unless specified otherwise. 

Now for an undirected X-forest Tj, we define 

Ai := A(Ti, r , L) := {A E Y, XL : f(A) E p{p{T h //), r Q )}, (5.2) 

and 

Ci := e(Ti) := 1 - Pr{^ | T h M(/i)}. (5.3) 

By selecting a sufficiently large value of L we can make q arbitrarily small as in Lemma I5T51 
Moreover, if Tj ^ Tj, then 

ey := Pr{A | T v M(p)} < 1 - PriAj \ Tj, M(ji)}) = e j; (5.4) 

hence Pr{„4j | Tj, M(/i)} can be made arbitrarily small as per Lemma [5751 We set e max = 
maxj(e(Tj)), which depends on L and ro. 

In Theorem 15. 131 (particularly in the proof of inequality [52]) we require a concentration 
inequality similar to the inequality in Lemma 15.51 for the situation in which L sites of an 
alignment have evolved on an X- forest T and cL sites (for a small c E (0, 1)) have 
evolved on another X- forest Tj. (We will keep the notation simple by assuming that cL 
is an integer.) Therefore, we give the following variants of Lemmas 15.41 and 15.51 

Lemma 5.6. Let X, X 1; X 2 , . . . be i.i.d. Bernoulli random variables with Pr{X = 1} = p. 
Let Yi, Yz, . . . be Bernoulli random variables. Let S n := Y^=i + $TJ«=i Yi> where c is 
a positive constant. Let r > and m := max{0,p — r} and M := min{l,p + r}. If 
r' := p — m(l + c) > and r" := M(l + c) — (p + c) > 0, then 



Pr 



-p 



n(l + c) 



-n(r') 2 I f -n(r") 2 

> r } < exp <^ — — ; ' - \ + exp ' 



2p(l -p) + 2r'/3j ^ [2p(l -p) + 2r"/3 
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Proof. Since Y^=i X i — < cn + J^" =1 X i: we have 



n(l + c) 



> r 



p < — r or — — p > r 



n(l + c) / V^U + c . 

- p < -r or — — ^ l= \ -p>r 



n(l + c) / \ n(l + c 



n(l + c) J \ n(l + c) 

YPi=i X i V s V\ f cn + Y^i=i X i P + C ^n,r P + c 
< in or , ~ > M 



n(l + c) l + c~ 1 + cJ V n(l + c) 1 + c 

£i=l -^i ^ A / £i=l -^i \ // 

p < — r I or I — — p > r 



n j \ n 

Now we apply Bernstein's inequality (Lemma 15. 6ft to each term and obtain the desired 
bound. □ 

Let A be an alignment of length L(l + c). Suppose that L characters of A evolved on 
an undirected X-forest T and the remaining characters evolved on undirected X-forests 
Tx, T%, . . . , T c l. The following lemma states that if c is sufficiently small, then i(A) is 
concentrated near p(T, /i) for large L. Moreover, as in Lemma 15.51 if we require i(A) 
to be sufficiently near p(T, fi) with probability at least 1 — e max , then the length of the 
alignment L(l + c) must be f2(log(l/e max )). In the following lemma, we do not specify 
the constants c, q, r[ and r" precisely, but they can be chosen depending on r . 

Lemma 5.7. Let r 6 R + and fj, G (0, 1/|S|). Let A E £ XL(1+c) for a suitably chosen 
positive constant c. Let T, T 1; T 2 , . . . , T c l be undirected X -forest. Then 

Pr{f (A) p(p(T, fi), r ) | T L , T u T 2 , . . . , T cL , M(/i)} 

< yV exp / =^ Uexpl 

- {2pi(l - Pi) + 2r>/3 } y \2 Pl {l-p t ) + 2r'>/3 

where r • and rf are positive constants as in Lemma 15.61 

Proof. We apply Lemma [5.61 for each component of f(^4) and use the union bound as in 
the proof of Lemma [5.51 For each component, we use r := r /|£|' x as before. Constants 
r[ and r'( (and q, which are implicit) depend on ro and the probabilities pi := Pr{Cj | 
T,M(p:)}, i = 1 to |Epl. The constant c may be taken to be the smallest among 
Ci,i = 1 to |£pL □ 
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5.2 Identifying HMMs: a sketch of the ideas used to prove The- 
orem 15.131 



A hidden Markov model (HMM) is defined by two sequences {X n } n >i and {l^} n >i 

of 

random variables. The sequence {X n } n >i takes values in [r] and is a stationary Markov 
chain with transition matrix A and initial distribution 7r(z),z = f to r, which is also 
the stationary distribution of the Markov chain. The random variables {Y n } n >i take 
values in [k], and are independent and identically distributed conditional on {X n } n >i. 
The distribution of Y n depends only on X n . Let B be the r x k matrix of conditional 
probabilities Pr{F n = j \ X n = z"}, where z G [r] and j G [k]. The sequence {Yn} n >i are 
the observations. Identifiable hidden Markov models were characterised in [TTJ, where a 
precise description of conditions on A and B for which the probability distribution on 
observed sequences determines A and B (up to re-labelling of hidden states) was given. 
Here identifiability up to a re-labelling of hidden states means the following: If S is an 
r x r permutation matrix, then the HMM with parameters (A, B, it) (where it is treated 
as a column vector of length r) is equivalent to (induces the same distribution on the 
space sequences of observed states as) the HMM with parameters (S^AS, S^B, S^tt). 
Therefore, identifiability only means computing the matrices and the initial distribution 
up to equivalence. We denote the class of models equivalent to (A,B,tt) by \\(A, B,7r)\\. 

Earlier we noted that Models R and RM for sequences evolving on a pedigree P{X) 
define a hidden Markov model with the spanning forests in P as hidden states and char- 
acters from Tj X as observed states. We call it HMM(P,p, n) and denote its matrices by 
A(P, p) and B(P,/j,). The initial distribution on hidden states is uniform: each spanning 
forest has the probability l/2 e if the pedigree has 2e arcs. We informally look at some of 
the issues about its identifiability. 

The transition matrix A(P, p) is defined by transition probabilities given in Equa- 
tion ( 13. ip . Therefore, A(P,p) will be identical (up to a permutation of rows and columns) 
for all pedigrees with the same number of arcs, for a fixed p. But the set of spanning 
forests (hidden states) is unknown. 

We now describe at a high level how we compute the rows of B(P,fi). Suppose the 
pedigree contains an undirected X- forest Tj. There are n(G > T; L : P) spanning forests G 
of P that contain Tj as the unique undirected X- forest, and corresponding to each of them 
we have a row of B(P, fi) that is equal to p(Tj, fi). Now consider a set Ai := A(Ti, r , L) of 
sufficiently long alignments as defined in Equation 15.21 We compute the probability of Ai 
(i.e., the probability that a random alignment is in Ai). Suppose that P is the probability 
that there are no recombination events. Thus Pq approaches 1 as p approaches 0. Then one 
of the terms in the expression for the probability of Ai will be n(G > Tj : P)Po(l — 6j)/2 e , 
where n(G > Ti : P)/2 e is the probability that the first site evolved on a spanning forest 
G that contained Tj as the undirected X- forest. There will be terms for contributions 
from other undirected X-forests Tj ^ Tj, but they will be much smaller than the above 
term because they will contain factors ej (as in Equation 15. 4p . There will also be terms 
to account for recombinations among the first L sites, but they will be small as well since 
they will contain p as a factor (in contrast to Pq, which is a power of (1 — p)). So let us 
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say Pr{A | P,RM(p,/j,)) is n(G > % : P)P (l - ej)/2 e + (terms of smaller order). 
Therefore if p is sufficiently small and L is sufficiently large, then Pr{Ai \ P, RM(p, //)) 
will be roughly equal to the dominating term n(G > Ti : P)P (1 — Cj)/2 e , which will 
uniquely determine n{G > Tj : P). In other words, if Q is another pedigree such that 
n{G > Ti : P) ^ n(G > T { : Q), then Pr{A | P, M(p, /i)) and Pr{A | Q, M(p, //)) will 
differ roughly by a multiple of Po(l ~~ e i)/2 e - 

In the proof of Proposition 14.6} we used a similar idea: the pedigree Q 2 contains a 
certain subtree T in which i, j, k have a common ancestor, while the pedigree P 2 does not 
such a subtree. As a result, the alignments that are close to p(T, /i) are more likely to 
have evolved on Q 2 than on P 2 . 

Suppose now that P(P, /i) is identified and each of its rows is labelled by the cor- 
responding unlabelled undirected X-forest. That is, the matrices P(P, /i) that appear 
among the triples in the equivalence class \\(A, P,7r)|| of HMMs are constructed. As 
pointed out above, the matrix A(P,p) and the initial distribution are also known up to 
relabelling of hidden states. But the equivalence class \\(A, B, is not known unless we 
are able to label the rows and the columns of A(P,p) by unlabelled undirected X-forests 
in a manner consistent with the labelling of rows of P(P, /i). In other words, for full 
identifiability of HMM(P,p, /i), we would like to construct an automaton with transi- 
tion probabilities given by A(P,p) and with its states labelled by unlabelled undirected 
X-forests. Identifying the pedigree from the labelled automaton will then be a purely 
combinatorial problem. 

In this paper we do not succeed in constructing matrix A(P, p) with rows and columns 
labelled by undirected X-forests, but we are able to count certain types of walks (to be 
described next) on the automaton with vertices labelled by undirected X-forests. Suppose 
that Ti, T 2 , . . . , T m is a sequence of undirected X- forests such that no two consecutive ones 
are isomorphic. Analogous to n(G > T : P), we define n(G > T : P) as the number of 
sequences G 1 ,G 2 , . . . , G m of spanning forests in P such that Gi > Tj, where consecutive 
Gi are separated by just one recombination. (A single recombination at a site is more 
likely than multiple recombinations. Moreover, if there is a recombination at a site i, 
but the two spanning forests Gi and Gi + \ contain isomorphic undirected X-forests, then 
such a recombination has no effect on the emitted characters. These are the reasons why 
we consider the sequences Tj and Gi as above.) We then consider a set A of alignments 
of length mL (for a suitably large L) obtained by concatenating alignments from Ai for 
i = 1 to m. We compute the probability of A (as we described for n{G > T) above), and 
show that the dominating term is proportional to n(G > T : P), and other terms are of 
smaller order of magnitude for small p. This allows us to compute n(G > T : P). 

A more visual description of the walks may be given as follows. Suppose the pedigree 
has 2e arcs. We define a graph on the vertex set consisting of the spanning forests of 
the pedigree, with two spanning forests Gi and Gj being adjacent if there is exactly one 
recombination separating them (i.e., \E(Gi) A E(Gj)\ = 2). The graph is a hypercube. 
The hidden Markov chain on the set of spanning forests jumps on the vertices of the 
cube. If there is at most one recombination at any site (which is more likely than more 
than 1 recombinations at a site), then we have a walk on the edges of the cube. We 



25 



label each vertex Gj of the cube by the undirected X-forests T u {Gi). Our interest is to 
construct this object for a more complete understanding of the HMM. But problem is 
made difficult by the fact that the emission probabilities associated with G{ and Gj are 
identical if T u (Gi) = T u (Gj). Therefore, we construct a weaker object, namely the number 
of walks of each length on the cube such that consecutive vertices have distinct labels. 



5.3 The main results 

Definition 5.8. Let P be a pedigree. For T := (T 1: T 2 , . . . , T m ) G Up, we define 

n(G > T : P) 
:= n(G 1 >T 1 ,G 2 >T 2 ,...,G m >T m :P) 

:= |{G G Qp : T u {Gi) * T^i G [m] A \E(G i+1 ) A E(G<)\ = 2 V i G [m - 1]}|, 

where the second condition in the last line says that there is exactly one recombination 
event between consecutive Gj. 

In the rest of this section, we show how invariants n(G > T : P) may be computed 
from the probability distribution on the space of alignments under Model RM. In the end, 
we demonstrate an application to pedigrees Pi and Q\ shown in Figure El 

Lemma 5.9. Let P be a pedigree with e(P) = 2e arcs. Let G := (G\, G 2 , ■ ■ ■ , G m ) be a 

sequence of spanning forests in P. Then under model R(p), the probability that G is a 
sequence of site-specific spanning forests is given by 

H _ „WG)„r(G) 

Pr{G \P,R(p)}= [ - P) 2& P , 

where r(G) and s(G) are as in Definition \2.8\ . 

Proof. We have a factor p for each recombination event and (1 — p) whenever there is 
no recombination (i.e., an arc is contained in two consecutive spanning forests in the 
sequence). The probability that the first spanning forest is G% is l/2 e . □ 



Notation 

Let A G H XL be an alignment of length L. For an interval l 2 ] C [L], we write A[l\, l 2 ] 
for the part of the alignment between columns l\ and l 2 (inclusive of columns l\ and l 2 ). 
For a sequence of alignments Ai G Ti xli ,i G [m], let A := A\ : A 2 : . . . : A m denote the 
alignment obtained by concatenating alignments A iy i = 1,2, ... ,m (in that order). Let 
At C Y, xl \i G [m] be sets of alignments. We define 

A:=A 1 :A 2 :...:A m :={A 1 :A 2 :...:A m \A i eA i ,ie [m]}. 
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Lemma 5.10. Let P be a pedigree with e(P) = 2e arcs. The probability of an alignment 
A G Tj XL on P is given by 



Vx{A | P,RM(p,/j>)} 

L fl _ n )e(L-k)+s(G) r(G) fc 

E E- — -^-e — — E n^vi+Uiiir.^.MW}, 



7* ■ i=l 



k 



(5.5) 



where Lq := 0, := Lj_i + Zi and 1 := (Zi, 1%, . . . , h) G one? £/ie second summation is 
over G snc/i i/ia£ consecutive spanning forests G{ and Gj+i are unequal for % G [A; — 1] . 

Proof. The probability of an alignment of length L is obtained by summing its probability 
over all spanning subgraph sequences of length L. Suppose that the recombinations in a 
spanning subgraph sequence occur only at sites Lj := L^i+li, for % — 1, 2, . . . , fc— 1, where 
L = 0. We write the spanning forest sequence of length L as (G?,i = 1, 2, . . . , k). Then 
the probability of the alignment is written as a product of probabilities of its segments 
that have evolved on spanning forests Gi (i.e., effectively on T u {Gi)). This probability is 
summed over 1 G (with the constraint ^2 i U = L), k G [L] and G G Q P . For a fixed 
G G <?p, the probability of (G'% z = 1, 2, . . . , k) is given by Lemma [531 □ 

For A C £ XL , we will compute Pr{^4 | P, RM(p, fj,)} by summing (I5.5P over A e A. 
In such a calculation, we will sometimes use the following upper bound in evaluating the 
last summation in ( 15. 5ft for fixed values of fc, and fixed 1 and G. 

Lemma 5.11. Let A C S XL . Let = L < Li < . . . < L k = L. Let Ai := {A[Li_i + 
1, Lj] : A G .4.}, i G [k]. Then for any fixed G E Qp 

Vr{A | G, M(n)} < ?r{A 1 :A 2 :...:Ak\G, M(p)}. 

We have equality if A = A\ : A2 Ak- 

Proof. The claim follows from the observation that A C A\ : A2 '■ ■ ■ ■ '■ Ak- □ 

The following lemma is used in the proof of Equation (I5.8p . and the reader may skip 
it until then. 

Lemma 5.12. Let Q be a finite set. Let T and S be two sequences in Q, each of length 
I, defined by 

■— 1 1 I 2 ■■■ I m ■— I II 1 li ■ ■ ■ 1 1 li I 2i -1-2, ■ ■ ■ 1 -1-2, ■ ■ ■ 5 J mi 1 mi ■ ■ ■ i 1 mi 

where YlT=i a « = ^ an< ^ a « > G [m], and 

S. oft o02 qPn CC CCC C QQ Q 

. — l>2 • • • ° n ■ — °1> J i) • • • i °li °2; ^2; • • • ; ^2; • • • j >->ni &ni • • • i °; 
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where ^27=1 & = ^ an< ^ & > ^ M ■ Suppose that T and S satisfy the constraints: 
Tj 7^ T i+ iVi G [to — 1]; 5j 7^ Si+iVi G [to — 1]; n < to; if n = m, then Si 7^ Tj for some 
i. Then there is at least one block of T over which T and S mismatch everywhere; 
hence there are at least min{a;j : i G [to]} mismatches between the two sequences. 

Proof. Suppose that the claim is false; so in each block of T, there is a matching 
symbol in S. Therefore, T 1; T 2 , . . . ,T m is a subsequence of S and n > m. This, together 
with Tj 7^ T i+1 Vz G [m — 1] and n < m, implies that n = to and = TjVz G [m], which 
contradicts the assumption that when n = m, there is some i for which 7^ Tj. □ 

Theorem 5.13. Let P and Q be any two pedigrees with e(P) = e(Q) = 2e arcs. Let T : = 
(Tj, % = 1, 2, . . . , m) 6e any sequence of undirected X -forests in which consecutive X -forests 
are non-isomorphic. Then for all ji G (0, 1/|S|), there exists p := p (e,m,fj,) G (0,1) 
swc/i that for all p G (0,p ), the following statement is true: if (S XL : P, RM(p, /i)) = 
: Q,RM(p,n))VL G N, tfien n(G > T : P) = n(G > T : Q). 

Proof. Let .4. := .A™ : ^2* : ■■■ : -^mi where ^4, := *4.(Tj,ro, L) and r are as defined 
in Equations (15.11) and (15.21) . respectively. We will choose e max (defined at the end of 
Section loTTl) and L (that depends on e max ) suitably later. The probability of A on P and Q 
is written by summing (15. 5p over all A G A. But based on Theorem 15. 1\ Corollary 15.31 and 
Lemma [577] we can make the following qualitative and somewhat informal statement: If L 
is large enough, then Pr{.4 | P, RM(p, //)} will get significantly higher contribution from 
spanning forest sequences G := (G l ^,i = 1, 2, . . . , to) of length m 2 L such that T u {Gj) = T { 
for all % G [m] and Zj are all close to L, than from spanning forest sequences G := (Gj, z = 
1,2,..., m 2 L) for which there are many mismatches (in terms of isomorphism) between 
sequences (Tj mL , i — 1, 2, . . . , to) and ((T n (Gj)), z = 1, 2, . . . , m 2 L) or if they require more 
than to — 1 recombinations. 

Let Pr{*4 I P, RM(p, fi)} = J2ken ^k(P), where Pfe(P) is the joint probability of A and 
the event that there are exactly k recombinations. Furthermore, we write P( m _i)(P) = 
P(m-i)a(P) + P(m-i)fe(P) ; where P( m -i) (P) is the contribution from spanning forest se- 
quences G := (C'% i = 1, 2, . . . , to) of length m 2 L such that T u (Gi) = for all i G [to], 
and P( m _i)b(P) is the remaining contribution to P m _i, i.e., from spanning forest sequences 
G := (G 1 ?, i = 1, 2, ...,/ + 1) of length m 2 L such that either I < to — 1 (i.e., the to — 1 
recombinations occur at fewer than to — 1 sites), or I = m — 1 and T u (Gi) 7^ Tj for some 
i G [to]. 

We will show that only P( m _i) a (P) makes a significant contribution to PrjA | P, RM(p, /x)}. 
In particular, we will show that the various contributions to Pr{A | P, RM(p, /x)} take 
the following form: 

P (m -i)a(P) = n(G>T:P)A(T) 
^ P fe (P) + P (m _ 1)fe (P) < 8 1 

k<m—l 

^Pfc(P) < fc, 

k>m 
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where A(T) does not depend on the pedigree (but depends on the undirected X-forest 
sequence T). Moreover, A(T), 81 and 8 2 depend on e, m, L,p and e max . We will show that 
when p and e max are sufficiently small and L is sufficiently large, 8\ and 82 are very small 
compared to A(T). It will imply that for Pr{A \ P, RM(p, p)} and Pr{„4 | Q, RM(p, p)} 
to be equal, n(G >T:P) and n(G > T : Q) must be equal. (Otherwise, there would be 
a difference between Vr{A \ P, RM(p, p)} and Pr{.4 | Q, RM(p, p)} that is of the order 
of a multiple of A(T).) 



A lower bound on P( m -i)a{P)- 



Since the consecutive X-forests in T are nonisomorphic and T u (Gi) = Ti for all % G [m], 
there is at least one recombination between consecutive spanning forests G{. And since 
there are exactly m — 1 recombinations, there must be exactly one recombination between 
consecutive spanning forests. 

Therefore, 



P(m-l)a(P) 



n(G >T:P) 



pje(m 2 L-tn)+(m-l)(e-l)pm-l 



where L := 0, 
of G := (G?,i 



e e n pr ^ Li - i+i ' L *] \ T u(Gi),M(p)} 

AeA lGZ™: i=l 
E k=m 2 L 



\ 

n(G > T : P)A(T), 



(5.6) 



1,2, 



-i + Zj. The inner summation is evaluated for any fixed choice 
. ,m) such that T u (Gj) = Ti for all i e [m], because for any 



fixed li, % 6 [m], and spanning forest sequences G := ((G, 
((G 



1,2, 



, m) and G 



/ ._ 



1,2, 



, m), if T u (Gi) = T U (G'^) for all i G [m], then for each alignment A, we 
have Pr{A[Li_! + l,L t ] \ G\\M(p)} = Pr{A[Li_i + | (GJ)**, M(/i)} = Pr{A[Li_! + 
l,Lj] I T u (Gi), M(p)}. Therefore, the RHS is a product of n(G > T : P) and a factor 
that does not explicitly depend on the pedigree, but only on T. 
We have 



ACT) > 



(2cL) m ' 1 (l - p )em 2 i-m-e+l p m-l( 1 _ £max ){m a ) 



(5.7) 



To prove the lower bound, we sum ( 15. 5 p over spanning forest sequences of the type G :- 



1,2, 



m. 



where T u (Gi 



recombinations occur at sites Li G 



= Ti for all i G [m] and l{ are such that the m — 1 
[imL — cL,imL + cL),i = 1 to m — 1, for a small 



positive constant c, (i.e., we ignore contributions from the spanning forest sequences in 
which some of the recombinations are not near the boundaries of the blocks A m 
We can choose the recombination sites in (2cL) m ~ 1 ways. Since s(G) = em 2 L 



, of A). 
e — m + 1 
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and r(G) = m - 1, the probability of G is (1 - p yrn 2 L-m-e+i p m-i / 2 e_ W e write G := 
(Gi, i = l,2,..., m), where Gi are blocks of G of length mL each. Then we have 

Pr{^r|Gi,M(/i)}>(l-e mai .) m 

for each G and for all i, provided we choose c, L and e max appropriately according to 
Lemma 15.71 

An upper bound on Y,k<m-i p k{ p ) + P(m-i)b{P)- 
We have 

Pk(P) + P(m-l) b (P) 

k<m—l 

fc=0 \l=0 ^ ' / 

< C2L m ~ 1 e max := 5i, (5.8) 

where n M is the number of spanning forest sequences (Gi,G 2 , ■ ■ ■ ,Gi+i) in which k re- 
combinations occur at I sites, and c 2 > is a constant that depends on e and m. We 
explain below how the bound is obtained. 

We fix G G G™ L such that r(G) = k < m — 1. For fc = m — 1, since we are 
only interested in the contribution Pr m -i)b, we fix G with the additional restrictions in 
the definition of P( m -i)h- Suppose that the k recombinations occur at I distinct sites 
Li, L 2 , . . . , L h where = L < L x < L 2 < . . . < L\ < L i+1 = m 2 L. So we write G := 
{G[\ G { 2 L2 - Ll \ . . . , G { ™l L ~ Ll) ). There are n k \ choices for (Gi,G 2 , . . . ,Gi+i). For each 
choice of (G\, G 2 , . . . , there are ( m choices for the recombining sites Li, . . . , L\. 

Each G has a probability (1 — p) s ( G )p r ( G )/2 e , where r(G) = k and s(G) = e(m 2 L — 1) — 
A;. We show that Prj^ | G,M(/i)} is bounded above by (e max ) m_ ' for each G with / 
recombining sites that satisfies the above constraints. 

Let G := (Gi, G 2 , . . . , G m 2 L ) be a spanning forest sequence of length m 2 L. For i e 
[m],j G [m], we refer to the subsequences Gi := (Gk,k G [(i — l)mL+l, imL]) as blocks of 
G, and subsequences G^ := (G^, e [((* — — 1)L+1, ((z— l)m+j)L]) as subblocksof 

G. We say that subblock Gy is recombination-free if there are no recombinations between 
any two sites of the subblock. (The subblock may have recombinations at its boundaries.) 

By Lemma r5.12[ there is at least one block, say the z-th block, over which the sequences 
T and {T u (Gi) L \ T u (G 2 ) (L2_Ll) , • • • , T u (G m )( m2L - L <)) mismatch everywhere. Since there 
are m subblocks in each block and I < m — 1, there are at least m — I recombination-free 
subblocks in the z-th block of G. Let these subblocks be denoted by Gij k := (Gij k ) L , k = 
1, 2, . . .. We have Tj ^ T u (Gij k ) for each of them. Therefore, 

Pr{A | G,M(ji)} < ]JPr{A t | T u (G ijk ,M(fi)} < (e ma ,) m "'- 

k 
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An upper bound on J2k>m Pk{P) ■ 

We use the following fact about binomially distributed random variables: If X ~ 
Bin(n,p), then Pr{X > k} < (™)p fc . (This result is a consequence of the union bound.) 
Since there are e(m 2 L — 1) points at which a recombination can possibly occur, we have 

£ P k (P) < ( e(m2L ~ 1} V" 1 < <*W := 5 2) (5.9) 



m 

k>m 



where C3 > is a constant that depends on e and m. 
We write similar bounds for Q. 

Now suppose that Pr{A | P,RM(p,/j,)} = Pr{A | Q,RM(p,/j,)} but n(G > T : P) ^ 
n(G > T : Q). Therefore, Equations 15.61 and 15.71 imply that 

I P(m—l)a {P) ~ P{m-l)a (Q)\ 
= |n(G > T : P) -n(G > T : Q)|A(T) 

> (2cL) m ~ 1 (l - p )K^^ + iy m -i) (1 _ emag )( m 2 ) 

But that is impossible if we can choose e max , L, and p so that 5i + 5 2 < A(T), or 

c^-^w + c 3 (L P T < (2CLP) {1 ~ P) 0P (1 - €max) . (5.11) 



In other words, the discrepancy |P( m _i) a (P) — P( m -i) a (Q)\ cannot be compensated for by 
the remaining terms in Pr{„4 | P, RM(p, /i)} and Px{A \ Q, RM(p, /i)}. Such a choice is 
possible. For example, we first choose e max G (0, 1), let us say, e max = 1/M, where M > 1. 
We then set L := L(M) and p := p(M) so that 

(emax/p™- 1 ) ^0 as M — > 00, 

— y as M — )• 00, and 

(1 _ p )(em^-m-e+l) (1 _ w )( m 2 ) _^ J ag M > CO. (5.12) 

The choice of L must guarantee phylogenetic consistency as in Corollary 15. 31 Moreover, 
we must set L as small as possible so as to get a better bound on p. By Lemma 15.71 L is 
required to grow only logarithmically in l/e max , so let L := clogM for a suitable choice 
of c > 0. Now conditions (15.121) are satisfied for p := (logM)~( 1+e ) for e > 0. Then fl5TTD 
is satisfied for sufficiently large M. □ 

5.4 Applications 

In the following, we illustrate Theorem 15.131 with a general result on counting X-forests 
and a couple of specific examples. 
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Corollary 5.14. If the conditions of Theorem \5.13\ are satisfied, then P and Q have the 
same number of undirected X -forests of each type. 

Proof. We apply Theorem 15.131 for m = 1. Suppose that P is a pedigree with e(P) = 2e 
arcs. Let T be an undirected X-forest. Suppose that we want to count the number of 
copies n(T : P) of T in P. Let T^i — 1 to n(T : P) be the distinct copies of T in P. 
For each Tj, let Ty, j = 1, 2, ... be the undirected X-forests in P that contain T«. (To be 
precise, we have subgraphs Si and SV,- in P such that when the (directed) arcs of Si and 
Sij are replaced by (undirected) edges, we get the undirected X-forests Tj and Ty. But 
we do not make this distinction in the following.) Note that are all distinct X-forests. 
There are e — e(Tj) non-founder vertices in P at which we can choose one of the two 
incoming arcs to construct a spanning forest G that contains Tj. Therefore, for each Tj 
there are 2 e ~ e ^ = 2 e ~ e ( T ) spanning forests that contain Tj. Therefore, 

n(T : P)2-W = ^^|{Gefc: T u (G) = T tJ }\. 

i j 

Now by grouping terms on the RHS by isomorphism classes of 7y, we obtain 
n(T : P)2 e ~ e(T) = ^ n(G > T' : P). 

(T'e\\U P \\)A(T'>T) 

Since n(G > X" : P) = n(G > T' : Q) for all undirected X-forests T', we also have 
n(T : P) = n(T : Q) for all undirected X-forests T. □ 

Corollary 5.15. The examples of non-reconstructible pedigrees given in JTSjj/ can be dis- 
tinguished from the probability distribution on the space of alignments under Model RM. 

Proof. Pedigrees in these examples do not have the same number of X-forests of each type. 
For example, one pedigree in each pair contains a common ancestor of all extant vertices 
while the other does not. This may also be observed in the examples in Figured □ 

But knowing the number of X-forests of each type is not always enough to distinguish 
pedigrees, as verified by pedigrees Pi and Q\ shown in Figure El They both have the 
same number of directed (therefore, also undirected) X-forests of each type: they have 4 
directed X-forests in which 1 and 2 have a common ancestor. We denote their undirected 
X- forest by T±, which is a path of length 4 with end vertices 1 and 2. There are 12 
directed X-forests in which 1 and 2 do not have a common ancestor. We denote their 
undirected X-forest by T 2 , which consists of two isolated vertices 1 and 2. Moreover, 
n(G > T x : Pi) = n(G > T x : Q x ) = 16 and n(G > T 2 : P 1 ) = n(G > T 2 : Q x ) = 48. Also, 
for T := (Ti,T 2 ) (and for T := (T 2 ,Ti)), we check by direct counting that n(G > T : 
Pi) = n(G > T : Qi) = 64. But P 1 and Qi can nevertheless be distinguished as shown 
below. 

Corollary 5.16. Pedigrees P\ and Qi in Figured are distinguished from the probability 
distribution on the space of alignments under Model RM provided the crossover probability 
is sufficiently small. 
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Proof. Let T\ and T 2 be the X-forests as described above. We apply Theorem 15.131 
for m = 3 with T := (T 1; T 2 ,7i). We can verify that n(G > T : P x ) = 112 and 
n{G > T : Qi) = 104. Therefore, Pi and Qi give different distributions on the space of 
alignments. In the following, we describe how n(G > T : Pi) and n(G > T : Q\) are 
counted. 

Let T a denote the directed X-forests in Pi and Qi consisting of paths a ■ ■ ■ 1 and a ■ ■ ■ 2. 
Similarly, we write T b , T c , T d for other directed X-forests in Pi and Q x , rooted at b, c and d, 
respectively. These are the 4 directed X-forests that have Ti as their undirected X- forest. 
All other directed X-forests have T 2 as their undirected X-forest. 

Counting n(G > T : Pi): Since T u (Gi) = T U {G 3 ) = Ti, we count 16 different contribu- 
tions to n{G > T : Pi) depending on the choices for T d {Gi) and T d {G 3 ) in {T a , T b , T c , T d }. 
Please refer to Figure |8j 




12 12 
Pi Pi 



Figure 8: Counting n{G > T : P x ) when T d {Gi) = T a and T d (G 3 ) = T b (left), and when 
T d {Gi) = T a and T d (G 3 ) = T c (right) 

When T d {Gi) = T a and T d (G 3 ) = T a : This is possible only if G 3 = G x . There are 4 
choices of G\, and for each of them there are 4 choices for G 2 depending on which arc of 
T a is replaced to obtain G 2 . Thus we have 16 sequences G for which T^(Gi) = T a and 
T d (G 3 ) = T a . 

When T d (Gi) = T a and T d (G 3 ) = T b : the dashed arcs bg and be in Figure [8] on the 
left must be obtained by replacing ag and ae, and there are two sequences G that achieve 
this: either G 2 = G\ — ae + be or G 2 = G\ — ag + bg. Also, there are 4 possible ways 
to include arcs pointing to / and h in G\. Therefore, there are 8 sequences G such that 
T d (Gi) = T a and T d (G 3 ) = T b . 

When T d {Gi) = T a and T d {G 3 ) = T c : there are only 2 sequences G for which this is 
possible, since the arcs cf and eg (shown in bold in Figure on the right) must already 
be in G\. 
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When T d (Gi) = T a and T d (G 3 ) = T d : this case is similar to the case T d (G 3 ) = T c . 

Thus there are 28 choices of G such that T d (Gi) = T a , and similarly 28 choices each 
for T d (Gi) = T b , T d (Gi) = T c , and T d {Gi) = T d . Therefore, there are 112 sequences G 
such that G > T. 

Counting n(G > T : Qi): Again we count 16 different contributions to n(G > T : Qi) 
depending on the choices for T d {Gi) and T d {G 3 ) in {T a , T b , T c , T d }. Please refer to Figured 



a 



C 



d 




h 



a 



c 



d 




h 



Figure 9: Counting n{G > T : Qi) when T d {Gi) = T a and T d (G 3 ) = T b (left), and when 
T d (Gi) = T a and T d (G 3 ) = T d (right) 

When T d {Gi) = T a and T d (G 3 ) = T a : as in case of Pi, we have 16 sequences G for 
which T d (Gi) = T a and T d (G 3 ) = T a . 

When T d (G\) = T a and T d {G 3 ) = T^. the dashed arcs be and h2 in Figure |9]on the left 
must be obtained by replacing ae and g2, and arc bh must already be in G\. There are 
two sequences G that achieve this: either G 2 = G\ — ae + be or G 2 = G\ — g2 + h2. Also, 
there are 2 choices for arcs pointing to /, therefore, there are 2 choices for G\. Therefore, 
there are 4 sequences G such that T d (Gi) = T a and T d (G 3 ) = T b . 

When T d (Gi) = T a and T d {G 3 ) = T c : this case is similar to the case in which T d (G 3 ) = 
T b , therefore, there are 4 sequences such that T d (Gi) = T a and T d (G 3 ) = T c . 

When T d {Gi) = T a and T d {G 3 ) = T d : in this case, the arcs df and dh must already 
be in G\. Thus there are only two sequences G that are counted depending on whether 
G 2 = d - el + fl or G 2 = G 1 -g2 + h2. 

Thus there are 26 choices of G such that T d (Gi) = T a , and similarly 26 choices each 
for T d {Gi) = T b , T d (Gi) = T c , and T d (Gi) = T d . Therefore, there are 104 sequences G 
such that G > T. 

Thus we have verified that n(G > T : Pi) and n(G > T : Qi) are unequal for 
T := (Ti, T 2 , Ti), implying that Pi and Qi are distinguished by the probability distribution 
they induce on the extant sequences under model RM. □ 
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6 Discussion and open questions 



In this paper we have presented a rigorous mathematical framework for studying pedigree 
reconstruction problems under probabilistic models. We extended phylogenetic identifi- 
ability results to reconstruct pedigrees under an idealised model of recombination and 
mutation. The main result of this paper is the computation of a class of combinatorial 
invariants from the joint distribution of extant sequences. As a corollary, we were able to 
show that certain known examples of pedigrees that could not be distinguished from path 
lengths alone, could be distinguished by a more detailed analysis of subgraph sequences. 
Here we identify a few open problems and directions for future investigation. 
Identifiability of computationally tractable invariants: The invariants of a pedigree P 
defined by n(G > T : P) may be quite difficult to apply in general. Theorem 15.131 may 
be difficult to use computationally for bigger pedigrees. Even for other reconstruction 
problems in graph theory, computational verifications are difficult. For example, Ulam's 
reconstruction conjecture has been computationally verified to be true only for graphs on 
at most 11 vertices [9]. Even on restricted classes of graphs, computational reconstruction 
experiments are difficult to perform. It will be useful to derive from n(G > T : P) (or 
independently) other identifiable invariants that may be easier to use in computational 
experiments. 

Theorem 15.131 only states that if two pedigrees induce the same joint distribution on 
extant sequences under model RM, then they agree on the invariants n(G > T). But it 
will be important to prove a converse or a result of the type: two pedigrees induce the 
same distribution on extant sequences under model RM if and only if they take the same 
value for a class of combinatorial invariants. Such a result would reduce the identifiability 
problem to a purely combinatorial problem of proving or disproving that the class of 
invariants is complete. Such a class of invariants may be n(G > T) or it may be somewhat 
stronger than n(G > T). It may be possible to compute, by the methods of Theorem l5.13l 
other invariants, for example, spanning forest sequences in which consecutive spanning 
forests are not necessarily separated by just one recombination. 

Improving the bound on p: In the lower bound on A(T) in Equation (15. 7p . we have 1/2 6 
in the denominator, while C3 is roughly e m , where e is the number of arcs in a pedigree. 
Therefore, it is possible to obtain better bounds on p for the applicability of the main 
theorem for pedigrees with fewer arcs. Therefore, we would significantly improve the upper 
bound on p if we showed that a pedigree can be reconstructed from the collection of its 
subpedigrees (pedigrees of subsets of the extant population) of order k for some small k. 
In [15] we made a conjecture about how small k may be for pedigrees of order n in which 
the population remains constant over generations. Thus solving the purely combinatorial 
problem of reconstructing pedigrees from their subpedigrees, while important in its own 
right, will be useful for improving bounds on p in Theorem 15.131 

On the other hand it is also likely (although we do not conjecture) that Theorem 15.131 
is valid without restrictions on p, or for all but finitely many values of p, or for all p except 
when |E| takes small values. But we do not have good intuition as to why Proposition 14.61 
(with |S| =2) requires a more complicated argument and an upper bound on p. 
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Maximum likelihood computation of the invariants n(G > T): It will be of interest 
to derive statistical consistency results and bounds on sequence lengths, analogous to 
Corollary 15.31 and Lemma [5.5[ for computing the invariants n(G > T). For example, we 
would like to make the following qualitative statement precise. Suppose P is a pedigree 
with e arcs. Let e > be given. Suppose T is an undirected X-forest sequence of length 
m with no two consecutive X-forests isomorphic. Then there is a sufficiently large L e m 
such that if a collection of sequences of length L > L €}Jn evolved on P giving an alignment 
A, then the likelihood ratio L(A | P)/L(A | Q) is large for all pedigrees Q such that 
n(G > T : P) ^ n(G > T : Q). We expect that L e>m would be of the order of mlog(l/e). 

In the model RM, we assumed that the founders are independently assigned sequences 
from a uniform distribution. This assumption may be relaxed or replaced by more realistic 
assumptions. 
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Nomenclature 

(E XL : P, RM(p, //)) the space of alignments on the extant vertices of P as a probability 
space under model i?M(p, //); other probability spaces are denoted analogously, page 14 

[a, b], [a, b), . . . intervals in integers and reals, page 6 

[to] the set {1,2,..., to}, page 6 

= isomorphism between pedigrees, X-forests, graphs, etc., page 7 

||Tp||, ||Wp|| the sets of isomorphism classes of directed and undirected X-forests (or dis- 
tinct directed or undirected X-forests) in a pedigree P, respectively, page 10 

\\S\\ - isomorphism class of an object; if S is a class of objects, then it is the set of 
isomorphism classes of objects in the class, page 7 

||T|| the isomorphism class of an X-forest T, page 10 

£{A) : = (/i, /2, . . . , /at) vector of fractional site pattern frequencies in an alignment A, 
page 23 

G := (Gi, G 2 , . . • , G m ) - a spanning forest sequence of length to, page 10 
p(T,/i) := (pi,p2, ...,p N ) defined as p { := Pr{d | T, M(/i)}, page 23 
T := (Ti, T 2 , . . . , T m ) - an X-forest sequence of length to, page 10 

A := Ai : A2 : . . . : ^4 m a set of alignments obtained by concatenating alignments from 
sets Ai, i — 1,2,..., to, page 29 

«4(Tj, r , L) the set of alignments of length L whose fractional site pattern frequencies are 
within a radius r from p(Tj), page 24 

.A, vAj subsets of an alignment space such as page 15 

Qp the set of spanning forests in a pedigree P, page 10 

Tp,Up the sets of directed and undirected X-forests in a pedigree P, respectively, page 10 

/j, the substitution probability in models RM(p,/j,) and M(/x), page 14 

N the set of natural numbers, page 6 

Pr{.} the probability of an event, page 15 

p(s, r) a ball of radius r centred at s, page 23 

E finite alphabet, page 8 
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S x the set of site patterns on X, page 8 

H XL the set of alignments of length L on X, page 8 

Z the set of integers, page 6 

Z + the set of positive integers, page 6 

A,Ai,... alignments on X, i.e., maps from X to S L or elements of H XL , page 8 

A := A\ : A2 : . . . : A m an alignment obtained by concatenating alignments A4, i = 1, 2, . . . , m, 
page 29 

C, Cj, ... characters on X, i.e., maps C : X — > E, page 8 
d(x, y) 1-norm distance between x and y, page 23 

d~ (u) , d + (u) , d(it) - in-degree, out-degree, degree (or total degree) of a vertex u, page 7 
G = H - G and i7 are isomorphic, page 7 

G < H or H > G - when used for graphs (or isomorphism classes of graphs) G and H, it 
means G is isomorphic to a subgraph of if, page 7 

G C if or H D G - when used for labelled graphs G and H, it means G is a subgraph of 
H, page 7 

G, Gi, spanning forests in a pedigree, page 10 

n(G > T : P) number of sequences G of spanning forests in P for which T u (Gi) = Tj for 
all Gj in G and consecutive Gi are separated by exactly 1 recombination, page 29 

p the crossover probability in models R(p) and RM(p,fi), page 14 

P, Q, P(X, Y, U, E), ... pedigrees, page 7 

r(G) the number of recombinations in G, see Definition 12.81 page 11 

s(G) the number of points of no recombination in G, see Definition 12.81 P a g e H 

S k the set of ^-tuples of elements of a set S, page 6 

S x the set of all functions from X to S, page 6 

T, Tj, . . . - X- forests in a pedigree or X-forests, page 10 

Td{G) - the unique directed X-forest in a spanning forest G in a pedigree, page 10 
T U (G) - the unique undirected X- forest in a spanning forest G in a pedigree, page 10 
u < v - (for vertices u and f in a pedigree) there is a directed path from v to w, page 7 
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V(G), E(G) - vertex and edge sets of a graph, respectively, page 7 

v(G),e(G) - cardinalities of vertex and edge sets of a graph, respectively, page 7 

X the set of extant vertices of a pedigree, page 7 
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